In this homework assignment, we will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine.
Initially, we will examine the data for any problems that may exist, such as missing data, outliers, multicollinearity. After, we’ll take the necessary steps to clean the data, and build two poisson regressions, two negative binomial regressions, and two multivariate linear regression models using the training dataset. We will train the models and evaluate them based on how well they perform against the provided evaluation data. Finally, we will select a final model that provides the best balance between accuracy and simplicity to predict the number of cases of wine that will be sold given certain properties of the wine.
The training dataset contained 12795 observations of 17 predictor variables, where each record represented a commercially available wine. These variables included ADD TEXT
The evaluation dataset contained 3335 observations over the same predictor variables.
Target: Number of Cases PurchasedAcidIndex: Proprietary method of testing total acidity
of wine by using a weighted averageAlcohol: Alcohol ContentChlorides: Chloride content of wineCitricAcid: Citric Acid ContentDensity: Density of WineFixedAcidity: Fixed Acidity of WineFreeSulfurDioxide: Sulfur Dioxide content of wineLabelAppeal: Marketing Score indicating the appeal of
label design for consumers. High numbers suggest customers like the
label design. Negative numbers suggest customers don’t like the
design.ResidualSugar: Residual Sugar of wineStars: Wine rating by a team of experts. 4 Stars =
Excellent, 1 Star = Poor. A high number of stars suggests high
salesSulphates: Sulfate content of wineTotalSulfurDioxide: Total Sulfur Dioxide of WineVolatileAcidity: Volatile Acid content of winepH: pH of wineIn order to explore summary stats and distribution characteristics of our dataset, we’ll need to first conduct some basic transformations and cleanup:
target, a numeric variable indicating the number of cases
purchased.target,
suggesting this data might be used for prediction rather than validation
and evaluation of model performance. For clarity we’ll rename this this
dataset ‘prediction’ instead and create a separate validation hold-out
from the training data.index column labeling the
observations which can be excluded from the models.While exploring this data, we made the following observations:
stars,
labelappeal, acidindex,
target)target (number of cases purchased) varies between 0 and
8| variable | complete_rate | n_missing | min | max |
|---|---|---|---|---|
| acidindex | 1.00 | 0 | 4.00 | 17.00 |
| alcohol | 0.95 | 838 | -4.70 | 26.50 |
| chlorides | 0.95 | 776 | -1.17 | 1.35 |
| citricacid | 1.00 | 0 | -3.24 | 3.86 |
| density | 1.00 | 0 | 0.89 | 1.10 |
| fixedacidity | 1.00 | 0 | -18.20 | 34.40 |
| freesulfurdioxide | 0.95 | 799 | -563.00 | 623.00 |
| labelappeal | 1.00 | 0 | -2.00 | 2.00 |
| ph | 0.97 | 499 | 0.48 | 6.21 |
| residualsugar | 0.95 | 784 | -128.30 | 145.40 |
| stars | 0.74 | 4200 | 1.00 | 4.00 |
| sulphates | 0.91 | 1520 | -3.13 | 4.24 |
| totalsulfurdioxide | 0.95 | 839 | -823.00 | 1057.00 |
| volatileacidity | 1.00 | 0 | -2.83 | 3.68 |
One of the first characteristics that stand out is the presence of negative values for many chemical compounds, and the relative normality of their distributions. This suggests they have already been power-transformed to produce normal distributions for modeling.
Variables related to sugars, chlorides, acidity, sulfides and sulfates all seem to fall in this category. Considering that we are analyzing very tiny amounts of chemical compounds, we might assume their natural distributions may be highly skewed.
Variables acidindex (proprietary method of testing total
acidity of wine by using a weighted average) seem to be slightly
right-skewed.
The boxplots below don’t show any outliers that we need to deal with.
We also see the relationship between
stars/labelappealand target
variables. If labelappeal increases, target
variable also goes up (if score indicating the appeal of label design
for consumers increases, number of cases purchased increases as well).
stars=NAcorrelates to low values for target.
We may use the hint from from the assignment that a variable is missing
is actually predictive of the target, and won’t get rid of the missing
values for stars.
The plots below can help us to understand the relationship between independent variables and target variable.
ADD TEXT
In the correlation plot below, we see that variables
Star and LabelAppeal are the most positively
correlated to the response variable (more stars, as a result, more
purchases/ better label, more purchases). There is slight negative
correlation between AcidIndex and the target
variable. In terms of multicollinearity, we don’t see high correlation
between variables and there is a chance that we won’t need to deal with
it in our models.
We tried exponentiation of related to sugars, chlorides, acidity, sulfides and sulfates variables by the natural log and other values, but did not arrive at an obvious or consistent transformation approach - so we may not be able to interpret model results on the scale of the original values for these variables.
COULD YOU ADD THE TRANSFORMATIONS AND MAKE THE PARAGRAPH AS CONCLUSION?
Next we’ll find and impute any missing data. There are 8 predictor variables that contain NAs:
| is_na | pct | |
|---|---|---|
| stars | 4200 | 0.26 |
| sulphates | 1520 | 0.09 |
| residualsugar | 784 | 0.05 |
| chlorides | 776 | 0.05 |
| freesulfurdioxide | 799 | 0.05 |
| totalsulfurdioxide | 839 | 0.05 |
| alcohol | 838 | 0.05 |
| ph | 499 | 0.03 |
Heeding the warning in the assignment, “sometimes, the fact that a variable is missing is actually predictive of the target”, we’ll consider each of these variables carefully. While there may be data “missing completely at random” (MCAR) that we wish to impute, this may not always be the case.
The predictor Stars suggests that out of 16,000 wine
samples, about 25% have never been professionally reviewed. If we assume
the existence of a review has some impact on the sales of a wine brand
(whatever the reviewer’s sentiment), then imputing mean or predicted
values here might distort our model. Therefore, we’ll simply preserve
the NAs as the model functions automatically adjust for these
observations.
However, we’ll convert Stars from a numeric to a factor
to enable further analysis.
Next we consider some of the missing chemical compounds in our wines;
alcohol, sugars, chlorides, sulfites and sulfates, and measures such as
ph.
First, can safely assume that all wines in this dataset have an
actual ph score greater than zero (which would represent
the most acidic rank, such as powerful industrial acids.) We’ll want to
impute more reasonable values for these.
Based on some reading into the organic wines segment, there is a growing demand in the market for specialty products such as low-sulfite, low-sugar and low-alcohol wines. However, this still represents a very small segment of the overall market, and chemically it’s not likely for these compounds to be completely absent from the final product.
Additionally, the predictors freesulfurdioxide and
totalsulfurdioxide are linked - the amount of ‘Free’ SO2 in
wine is always a subset of the ‘Total’ S02 present. We only identified
59 cases where both these values were NA, while over 1500 cases had
missing values for only one or the other.
Based on these observations, we’ll use the MICE imputation method to
predict and impute the missing values for residualsugar,
chlorides, freesulphurdioxide,
totalsulfurdioxide, sulphates,
alchohol and ph.
Target/source labels and non-chemical predictors
labelappeal and stars were excluded as
predictors for the imputation.
labelappeal is a numeric score of consumer ratings for a
wine brand’s label design. It has also been pre-transformed to produce a
normal distribution for modeling; however this is a very sparse variable
with nearly half the cases having a value of zero.
This may be candidate for handling with Zero-Inflated models. We
won’t change the values here, but will convert labelappeal
from a numeric to a factor.
We now have reasonably imputed values, and nearly-normal
distributions for our numeric predictors, taking special note of the
frequency of zero values for labelappeal and
stars.
| variable | n_missing | n_zero |
|---|---|---|
| acidindex | 0 | 0 |
| alcohol | 0 | 5 |
| chlorides | 0 | 9 |
| citricacid | 0 | 151 |
| density | 0 | 0 |
| fixedacidity | 0 | 47 |
| freesulfurdioxide | 0 | 14 |
| labelappeal | 0 | 7087 |
| ph | 0 | 0 |
| residualsugar | 0 | 6 |
| stars | 4200 | NA |
| sulphates | 0 | 32 |
| totalsulfurdioxide | 0 | 11 |
| volatileacidity | 0 | 22 |
With transformations complete, we split back into training and
prediction datasets based on our source_flag, and create a
15% validation hold-out from the training data.
Poisson Regression assumes that the variance and mean of our
dependent variable target are roughly equal, otherwise we
may be looking at over- or under-dispersion.
pr1 <- glm(target ~ ., family = 'poisson', data = df_train)
##
## Call:
## glm(formula = target ~ ., family = "poisson", data = df_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.162e+00 2.290e-01 5.075 3.88e-07 ***
## fixedacidity 7.689e-04 9.341e-04 0.823 0.410419
## volatileacidity -2.510e-02 7.430e-03 -3.379 0.000728 ***
## citricacid 4.399e-04 6.736e-03 0.065 0.947927
## residualsugar 4.273e-05 1.723e-04 0.248 0.804120
## chlorides -2.557e-02 1.849e-02 -1.383 0.166768
## freesulfurdioxide 4.841e-05 3.917e-05 1.236 0.216533
## totalsulfurdioxide 3.310e-05 2.528e-05 1.309 0.190441
## density -3.003e-01 2.208e-01 -1.360 0.173821
## ph 1.317e-03 8.682e-03 0.152 0.879418
## sulphates -5.790e-03 6.276e-03 -0.922 0.356283
## alcohol 4.930e-03 1.569e-03 3.143 0.001673 **
## labelappeal-1 2.478e-01 4.829e-02 5.132 2.86e-07 ***
## labelappeal0 4.807e-01 4.718e-02 10.189 < 2e-16 ***
## labelappeal1 6.331e-01 4.781e-02 13.241 < 2e-16 ***
## labelappeal2 7.787e-01 5.253e-02 14.823 < 2e-16 ***
## acidindex -4.902e-02 5.311e-03 -9.231 < 2e-16 ***
## stars2 3.138e-01 1.559e-02 20.126 < 2e-16 ***
## stars3 4.275e-01 1.710e-02 24.999 < 2e-16 ***
## stars4 5.368e-01 2.347e-02 22.871 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7273.5 on 8015 degrees of freedom
## Residual deviance: 4805.2 on 7996 degrees of freedom
## (2881 observations deleted due to missingness)
## AIC: 28734
##
## Number of Fisher Scoring iterations: 5
| AIC | 28733.84 |
| Dispersion | 0.42 |
| Log-Lik | -14346.92 |
We note that our model has generated ‘dummies’ from our categorical
variables labelappeal and stars, and of the 20
total predictors, all but five have statistical significance.
Notably, our Dispersion Parameter is 0.42, which suggests a degree of under-dispersion in the data.
By graphing our target values (green) against our predicted values (blue) we can easily see this model tends to under-predict the higher count levels, and wildly over-predict the lower count levels.
We’ll build a Zero-Inflated Poisson model to handle the large number
of zero values in our labelappeal and stars
predictors, to see if we can improve model accuracy.
pr2 <- zeroinfl(target ~ . | ., data=df_train, dist = 'poisson')
##
## Call:
## zeroinfl(formula = target ~ . | ., data = df_train, dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.17641 -0.27896 0.04181 0.35555 3.81291
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.959e-01 2.343e-01 4.250 2.14e-05 ***
## fixedacidity 5.690e-04 9.565e-04 0.595 0.551905
## volatileacidity -1.139e-02 7.570e-03 -1.504 0.132552
## citricacid -7.126e-04 6.845e-03 -0.104 0.917085
## residualsugar -5.670e-05 1.746e-04 -0.325 0.745375
## chlorides -1.721e-02 1.887e-02 -0.912 0.361875
## freesulfurdioxide 1.035e-05 3.957e-05 0.262 0.793581
## totalsulfurdioxide -7.379e-06 2.508e-05 -0.294 0.768616
## density -3.055e-01 2.261e-01 -1.351 0.176636
## ph 7.151e-03 8.840e-03 0.809 0.418593
## sulphates 1.832e-03 6.368e-03 0.288 0.773549
## alcohol 6.182e-03 1.590e-03 3.888 0.000101 ***
## labelappeal-1 3.000e-01 5.050e-02 5.940 2.85e-09 ***
## labelappeal0 5.688e-01 4.935e-02 11.525 < 2e-16 ***
## labelappeal1 7.471e-01 4.999e-02 14.945 < 2e-16 ***
## labelappeal2 9.034e-01 5.460e-02 16.545 < 2e-16 ***
## acidindex -1.730e-02 5.572e-03 -3.104 0.001907 **
## stars2 1.313e-01 1.640e-02 8.003 1.21e-15 ***
## stars3 2.320e-01 1.783e-02 13.013 < 2e-16 ***
## stars4 3.341e-01 2.406e-02 13.888 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.010e+01 2.736e+00 -3.691 0.000224 ***
## fixedacidity -9.712e-03 1.070e-02 -0.908 0.363916
## volatileacidity 3.231e-01 8.611e-02 3.752 0.000176 ***
## citricacid -6.911e-03 7.743e-02 -0.089 0.928880
## residualsugar -3.192e-03 2.001e-03 -1.596 0.110549
## chlorides 1.986e-01 2.212e-01 0.898 0.369129
## freesulfurdioxide -1.108e-03 4.600e-04 -2.409 0.016001 *
## totalsulfurdioxide -1.158e-03 2.937e-04 -3.941 8.12e-05 ***
## density 1.800e+00 2.604e+00 0.691 0.489368
## ph 1.747e-01 1.015e-01 1.721 0.085183 .
## sulphates 2.080e-01 7.249e-02 2.869 0.004121 **
## alcohol 2.712e-02 1.755e-02 1.545 0.122236
## labelappeal-1 5.634e-01 6.157e-01 0.915 0.360146
## labelappeal0 1.264e+00 5.979e-01 2.114 0.034477 *
## labelappeal1 2.022e+00 6.029e-01 3.354 0.000795 ***
## labelappeal2 2.721e+00 6.813e-01 3.994 6.51e-05 ***
## acidindex 5.649e-01 4.863e-02 11.616 < 2e-16 ***
## stars2 -3.703e+00 3.544e-01 -10.450 < 2e-16 ***
## stars3 -1.842e+01 3.436e+02 -0.054 0.957251
## stars4 -1.843e+01 6.972e+02 -0.026 0.978916
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 50
## Log-likelihood: -1.387e+04 on 40 Df
| AIC | 27818.73 |
| Dispersion | 0.31 |
| Log-Lik | -13869.37 |
Using a Zero-Inflated model, the Dispersion Parameter drops significantly, but we are getting a better overall result for counts of 3 or more. By graphing our target values (green) against our predicted values (blue) we can see we are getting much greater accuracy rate for most of the mid- and upper counts.
Notably, we are still under-predicting counts of 1-2, and greatly over-predicting counts of zero.
Generally, we would use Negative Binomial Regression in cases of over-dispersion (where the variance of our dependent variable is significantly greater than the mean.) This does not appear to be the case with our dataset, but we’ll apply it here and examine the results:
nb1 <- glm.nb(target ~ ., data = df_train)
##
## Call:
## glm.nb(formula = target ~ ., data = df_train, init.theta = 137389.9036,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.162e+00 2.290e-01 5.075 3.88e-07 ***
## fixedacidity 7.689e-04 9.341e-04 0.823 0.410422
## volatileacidity -2.510e-02 7.430e-03 -3.379 0.000728 ***
## citricacid 4.399e-04 6.736e-03 0.065 0.947928
## residualsugar 4.273e-05 1.723e-04 0.248 0.804114
## chlorides -2.557e-02 1.849e-02 -1.383 0.166771
## freesulfurdioxide 4.841e-05 3.917e-05 1.236 0.216537
## totalsulfurdioxide 3.310e-05 2.528e-05 1.309 0.190443
## density -3.003e-01 2.208e-01 -1.360 0.173824
## ph 1.317e-03 8.682e-03 0.152 0.879427
## sulphates -5.790e-03 6.276e-03 -0.922 0.356283
## alcohol 4.930e-03 1.569e-03 3.143 0.001674 **
## labelappeal-1 2.478e-01 4.829e-02 5.132 2.86e-07 ***
## labelappeal0 4.807e-01 4.718e-02 10.189 < 2e-16 ***
## labelappeal1 6.331e-01 4.781e-02 13.241 < 2e-16 ***
## labelappeal2 7.787e-01 5.253e-02 14.823 < 2e-16 ***
## acidindex -4.902e-02 5.311e-03 -9.231 < 2e-16 ***
## stars2 3.138e-01 1.559e-02 20.126 < 2e-16 ***
## stars3 4.275e-01 1.710e-02 24.999 < 2e-16 ***
## stars4 5.368e-01 2.347e-02 22.870 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(137389.9) family taken to be 1)
##
## Null deviance: 7273.4 on 8015 degrees of freedom
## Residual deviance: 4805.2 on 7996 degrees of freedom
## (2881 observations deleted due to missingness)
## AIC: 28736
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 137390
## Std. Err.: 199804
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -28693.98
| AIC | 28735.98 |
| Dispersion | 0.42 |
| Log-Lik | -14346.99 |
As expected, the Negative Binomial Regression does not outperform the Poisson.
We’ll build a Zero-Inflated Negative Binomial model to handle the
large number of zero values in our labelappeal and
stars predictors, to see if we can improve model
accuracy.
nb2 <- zeroinfl(target ~ . | ., data=df_train, dist = 'negbin')
##
## Call:
## zeroinfl(formula = target ~ . | ., data = df_train, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.17642 -0.27896 0.04182 0.35556 3.81280
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.960e-01 2.343e-01 4.250 2.14e-05 ***
## fixedacidity 5.688e-04 9.565e-04 0.595 0.552038
## volatileacidity -1.139e-02 7.570e-03 -1.504 0.132580
## citricacid -7.134e-04 6.845e-03 -0.104 0.916996
## residualsugar -5.673e-05 1.746e-04 -0.325 0.745266
## chlorides -1.721e-02 1.887e-02 -0.912 0.361838
## freesulfurdioxide 1.036e-05 3.957e-05 0.262 0.793415
## totalsulfurdioxide -7.381e-06 2.508e-05 -0.294 0.768566
## density -3.056e-01 2.261e-01 -1.352 0.176524
## ph 7.151e-03 8.840e-03 0.809 0.418547
## sulphates 1.833e-03 6.368e-03 0.288 0.773476
## alcohol 6.182e-03 1.590e-03 3.888 0.000101 ***
## labelappeal-1 2.999e-01 5.049e-02 5.940 2.85e-09 ***
## labelappeal0 5.687e-01 4.935e-02 11.525 < 2e-16 ***
## labelappeal1 7.471e-01 4.999e-02 14.945 < 2e-16 ***
## labelappeal2 9.034e-01 5.460e-02 16.546 < 2e-16 ***
## acidindex -1.730e-02 5.572e-03 -3.104 0.001908 **
## stars2 1.313e-01 1.640e-02 8.003 1.21e-15 ***
## stars3 2.320e-01 1.783e-02 13.013 < 2e-16 ***
## stars4 3.341e-01 2.406e-02 13.888 < 2e-16 ***
## Log(theta) 1.750e+01 3.667e+00 4.772 1.83e-06 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.010e+01 2.736e+00 -3.691 0.000224 ***
## fixedacidity -9.713e-03 1.070e-02 -0.908 0.363850
## volatileacidity 3.231e-01 8.611e-02 3.752 0.000175 ***
## citricacid -6.918e-03 7.743e-02 -0.089 0.928804
## residualsugar -3.193e-03 2.000e-03 -1.596 0.110456
## chlorides 1.986e-01 2.212e-01 0.898 0.369105
## freesulfurdioxide -1.108e-03 4.600e-04 -2.408 0.016039 *
## totalsulfurdioxide -1.158e-03 2.937e-04 -3.941 8.10e-05 ***
## density 1.802e+00 2.604e+00 0.692 0.488998
## ph 1.748e-01 1.015e-01 1.722 0.085113 .
## sulphates 2.080e-01 7.249e-02 2.870 0.004109 **
## alcohol 2.711e-02 1.755e-02 1.545 0.122296
## labelappeal-1 5.607e-01 6.142e-01 0.913 0.361246
## labelappeal0 1.261e+00 5.964e-01 2.115 0.034412 *
## labelappeal1 2.020e+00 6.014e-01 3.359 0.000784 ***
## labelappeal2 2.718e+00 6.799e-01 3.997 6.41e-05 ***
## acidindex 5.649e-01 4.863e-02 11.616 < 2e-16 ***
## stars2 -3.703e+00 3.544e-01 -10.449 < 2e-16 ***
## stars3 -1.842e+01 3.442e+02 -0.054 0.957324
## stars4 -1.843e+01 6.976e+02 -0.026 0.978925
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 39804271.402
## Number of iterations in BFGS optimization: 58
## Log-likelihood: -1.387e+04 on 41 Df
| AIC | 27820.73 |
| Dispersion | 0.31 |
| Log-Lik | -13869.37 |
The Zero-Inflated Negative Binomial model sees similar improvement as with the Zero-Inflated Poisson, but as before does not outperform the Poisson.
For our first Multiple Linear Regression, we’ll use all predictors.
lm1 <- lm(target ~ ., data=df_train)
##
## Call:
## lm(formula = target ~ ., data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1886 -0.5311 0.0995 0.7384 3.2254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.771e+00 4.955e-01 7.610 3.05e-14 ***
## fixedacidity 2.894e-03 2.041e-03 1.418 0.1563
## volatileacidity -9.511e-02 1.620e-02 -5.871 4.50e-09 ***
## citricacid 2.022e-03 1.476e-02 0.137 0.8910
## residualsugar 1.683e-04 3.765e-04 0.447 0.6549
## chlorides -9.961e-02 4.044e-02 -2.463 0.0138 *
## freesulfurdioxide 1.703e-04 8.545e-05 1.993 0.0463 *
## totalsulfurdioxide 1.234e-04 5.506e-05 2.240 0.0251 *
## density -1.090e+00 4.827e-01 -2.259 0.0239 *
## ph 8.733e-03 1.895e-02 0.461 0.6449
## sulphates -1.968e-02 1.372e-02 -1.435 0.1515
## alcohol 1.875e-02 3.409e-03 5.502 3.87e-08 ***
## labelappeal-1 5.110e-01 7.801e-02 6.551 6.07e-11 ***
## labelappeal0 1.220e+00 7.630e-02 15.991 < 2e-16 ***
## labelappeal1 1.860e+00 7.892e-02 23.562 < 2e-16 ***
## labelappeal2 2.628e+00 9.843e-02 26.701 < 2e-16 ***
## acidindex -1.714e-01 1.101e-02 -15.567 < 2e-16 ***
## stars2 9.729e-01 3.094e-02 31.449 < 2e-16 ***
## stars3 1.475e+00 3.626e-02 40.671 < 2e-16 ***
## stars4 2.072e+00 5.662e-02 36.602 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.137 on 7996 degrees of freedom
## (2881 observations deleted due to missingness)
## Multiple R-squared: 0.4642, Adjusted R-squared: 0.463
## F-statistic: 364.7 on 19 and 7996 DF, p-value: < 2.2e-16
| AIC | 24825.60 |
| Adj R2 | 0.46 |
…
For our second Multiple Linear Regression, we’ll add stepwise feature selection.
lm2_all <- lm(target ~ ., data=df_train)
lm2 <- stepAIC(lm2_all, trace=FALSE, direction='both')
##
## Call:
## lm(formula = target ~ fixedacidity + volatileacidity + chlorides +
## freesulfurdioxide + totalsulfurdioxide + density + sulphates +
## alcohol + labelappeal + acidindex + stars, data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1881 -0.5326 0.1010 0.7350 3.2349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.803e+00 4.911e-01 7.744 1.08e-14 ***
## fixedacidity 2.897e-03 2.040e-03 1.420 0.1557
## volatileacidity -9.509e-02 1.619e-02 -5.874 4.41e-09 ***
## chlorides -9.998e-02 4.043e-02 -2.473 0.0134 *
## freesulfurdioxide 1.705e-04 8.542e-05 1.996 0.0460 *
## totalsulfurdioxide 1.236e-04 5.503e-05 2.247 0.0247 *
## density -1.091e+00 4.824e-01 -2.261 0.0238 *
## sulphates -1.966e-02 1.372e-02 -1.433 0.1518
## alcohol 1.871e-02 3.407e-03 5.492 4.10e-08 ***
## labelappeal-1 5.115e-01 7.799e-02 6.558 5.78e-11 ***
## labelappeal0 1.221e+00 7.629e-02 16.000 < 2e-16 ***
## labelappeal1 1.860e+00 7.891e-02 23.570 < 2e-16 ***
## labelappeal2 2.629e+00 9.839e-02 26.725 < 2e-16 ***
## acidindex -1.717e-01 1.096e-02 -15.667 < 2e-16 ***
## stars2 9.730e-01 3.092e-02 31.465 < 2e-16 ***
## stars3 1.475e+00 3.625e-02 40.685 < 2e-16 ***
## stars4 2.073e+00 5.660e-02 36.622 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.137 on 7999 degrees of freedom
## (2881 observations deleted due to missingness)
## Multiple R-squared: 0.4642, Adjusted R-squared: 0.4631
## F-statistic: 433.1 on 16 and 7999 DF, p-value: < 2.2e-16
| AIC | 24820.04 |
| Adj R2 | 0.46 |
…
We used the zipath() to fit zero-inflated poisson regression models with variable selection using lasso regularization. The count and the logit models both start with all the predictor variables and we used the coefficients parameters that generated the smallest AIC value for the model.
| count | zero | |
|---|---|---|
| (Intercept) | 1.0410 | -5.4758 |
| fixedacidity | 0.0006 | 0.0000 |
| volatileacidity | -0.0119 | 0.1949 |
| citricacid | -0.0009 | 0.0000 |
| residualsugar | 0.0000 | 0.0000 |
| chlorides | -0.0177 | 0.0000 |
| freesulfurdioxide | 0.0000 | -0.0002 |
| totalsulfurdioxide | 0.0000 | -0.0007 |
| density | -0.3270 | 0.0000 |
| ph | 0.0059 | 0.0000 |
| sulphates | 0.0012 | 0.0580 |
| alcohol | 0.0060 | 0.0000 |
| labelappeal-1 | 0.2845 | -0.2619 |
| labelappeal0 | 0.5471 | 0.0000 |
| labelappeal1 | 0.7245 | 0.3875 |
| labelappeal2 | 0.8797 | 0.3212 |
| acidindex | -0.0168 | 0.4609 |
| stars2 | 0.1340 | -2.4718 |
| stars3 | 0.2321 | -3.3585 |
| stars4 | 0.3351 | -2.6136 |
Theta Estimate:
The coefficients for the count model that survive the regularization
process include the dummy variables for labelappeal,
stars, and the variables density.
labelappeal - the dummy variables derived from label
appeal are strong indicators of the number of cases that will be
purchasedstars - the dummy variables derived from wine ratings
are strong indicators of number of cases that will be purchaseddensity - is a negative indicator of the number of
cases purchased by distributors suggesting that lighter wines are more
popular than full-bodied winesresidualsugar, totalsulfurdioxide,
freesulfurdioxide, and fixedacidityThe coefficients for the zero inflation model that survice the
regularization process include stars,
labelappeal-1, labelappeal1 ,and
volatileacidity.
stars - the dummy variables derived from wine ratings
are strong indicators of number of cases that will be purchasedlabelappeal-1 is negative and labelappeal1
is positive with no other label related dummy variable being included in
the model. This would suggest that label aesthetics only count at the
margins between positive and negative customer sentiment.volatileacidity - is the only other variable
coefficient that is included in the final model. With a negative
coefficient lower volatile acid content is preferred when making a
purchasing decision.…
| Model | mape | smape | mase | mpe | RMSE | AIC | Adjusted R2 | F-statistic |
|---|---|---|---|---|---|---|---|---|
| Poisson Regression 1 | NaN | NaN | 1.0339 | NaN | 2.6022 | 28733.84 | NA | NA |
| Poisson Regression 2 | NaN | NaN | 0.4355 | NaN | 1.4496 | 27818.73 | NA | NA |
| Negative Binomial 1 | NaN | NaN | 1.0339 | NaN | 2.6022 | 28735.98 | NA | NA |
| Negative Binomial 2 | NaN | NaN | 0.4355 | NaN | 1.4496 | 27820.73 | NA | NA |
| Multiple Linear Model 1 | Inf | 32.9110 | 0.5009 | -Inf | 2.6022 | 24825.60 | 0.4630 | 364.6602 |
| Multiple Linear Model 2 | Inf | 32.9116 | 0.5009 | -Inf | 1.1665 | 24820.04 | 0.4631 | 433.1453 |
| Lasso | Inf | 32.5251 | 0.4906 | -Inf | 1.1563 | 27948.07 | NA | NA |
‘Total Sulfur Dioxide – Why it Matters, Too!’
Iowa State University
https://www.extension.iastate.edu/wine/total-sulfur-dioxide-why-it-matters-too/